Partitioning strategies for distributed association rule mining

نویسندگان

  • Frans Coenen
  • Paul H. Leng
چکیده

In this paper a number of alternative strategies for distributed/parallel association rule mining are investigated. The methods examined make use of a data structure, the T-tree, introduced previously by the authors as a structure for organising sets of attributes for which support is being counted. We consider six different approaches, representing different ways of parallelising the basic Apriori-T algorithm that we use. The methods focus on different mechanisms for partitioning the data between processes, and for reducing the message-passing overhead. Both ‘horizontal’ (data distribution) and ‘vertical’ (candidate distribution) partitioning strategies are considered, including a vertical partitioning algorithm (DATA-VP) which we have developed to exploit the structure of the T-tree. We present experimental results examining the performance of the methods in implementations using JavaSpaces. We conclude that in a JavaSpaces environment, candidate distribution strategies offer better performance than those that distribute the original dataset, because of the lower messaging overhead, and the DATA-VP algorithm produced results that are especially encouraging.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization of Distributed Association Rule Mining Based Partial Vertical Partitioning

Association rule mining is a one of the most important technique in data mining. Data mining is the process of analyzing data from different angles & getting useful information about data. Modern organizations are geographically distributed. Using the traditional centralized association rule mining to discover useful patterns in such distributed system is not always feasible because merging dat...

متن کامل

A Novel Data Partitioning Approach for Association Rule Mining on Grids

Mining association rules refers to extracting useful knowledge from large databases. Algorithms of this technique are both data and computation-intensive, which make grid platforms very attractive for them. However, to exploit these platforms, new data partitioning features are required where the specificities of both association rule mining technique and grids must be taken into consideration....

متن کامل

Parallel Rule Mining with Dynamic Data Distribution under Heterogeneous Cluster Environment

Big data mining methods supports knowledge discovery on high scalable, high volume and high velocity data elements. The cloud computing environment provides computational and storage resources for the big data mining process. Hadoop is a widely used parallel and distributed computing platform for big data analysis and manages the homogeneous and heterogeneous computing models. The MapReduce fra...

متن کامل

Design and Analysis of a Dynamic Load Balancing Strategy for Large-Scale Distributed Association Rule Mining

Association rule mining is one of the most important data mining techniques. Algorithms of this technique search a large space, considering numerous different alternatives and scanning the data repeatedly. Parallelism seems to be the natural solution in order to be able to work with industrial-sized databases. Large-scale computing systems, such as Grid computing environments, are recently rega...

متن کامل

Fuzzy Associative Classifier for Distributed Mining

Distributed data mining extracts the knowledge from distributed data sources without considering their physical location. The need for such systems arises from the fact that, in real time many data bases are distributed geographically in different locations.Often transferring data produced at local sites to centralized site for extracting knowledge results in excessive time and transmission cos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Knowledge Eng. Review

دوره 21  شماره 

صفحات  -

تاریخ انتشار 2006